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Abstract 

Background: ATP is a ubiquitous nucleotide that provides energy for cellular activities, catalyzes chemical 
reactions, and is involved in cellular signalling. The knowledge of the ATP-protein interactions helps with 
annotation of protein functions and finds applications in drug design. The sequence to structure annotation gap 
motivates development of high-throughput sequence-based predictors of the ATP-binding residues. Moreover, our 
empirical tests show that the only existing predictor, ATPint, is characterized by relatively low predictive quality. 

Methods: We propose a novel, high-throughput machine learning-based predictor, ATPsite, which identifies ATP- 
binding residues from protein sequences. Our predictor utilizes Support Vector Machine classifier and a 
comprehensive set of input features that are based on the sequence, evolutionary profiles, and the sequence- 
predicted structural descriptors including secondary structure, solvent accessibility, and dihedral angles. 

Results: The ATPsite achieves significantly higher Mathews Correlation Coefficient (MCC) and Area Under the ROC 
Curve (AUC) values when compared with the existing methods including the ATPint, conservation-based rate4site, 
and alignment-based BLAST predictors. We also assessed the effectiveness of individual input types. The PSSM 
profile, the conservation scores, and certain features based on amino acid groups are shown to be more effective 
in predicting the ATP-binding residues than the remaining feature groups. 

Conclusions: Statistical tests show that ATPsite significantly outperforms existing solutions. The consensus of the 
ATPsite with the sequence-alignment based predictor is shown to give further improvements. 
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Background 

Adenosines-triphosphate (ATP) is a multi-functional 
nucleotide that plays an important role in energy meta- 
bolism, signaling, and replication and transcription of 
DNA. As of July 2010, 3860 structures in the Protein 
Data Bank (PDB) [1], which constitute about 6% of 
known protein structures, are annotated as ATP bind- 
ing. The ATP binding sites are regarded as valuable 
drug targets for antibacterial and anti-cancer che- 
motherapy [2,3]. Therefore, the protein- ATP interac- 
tions are of significant interest. 

Past two decades observed a substantial effort in iden- 
tification of conserved characteristics of the ATP- 
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binding sites. Most of these approaches are based on a 
relatively simple analysis of ATP-binding sequences and 
structures that led to identification of sequence motifs 
and structural templates. For instance, the p-loop motif 
that interacts with ATP and its analogs was found in 
several protein families [4] and structural templates that 
interact with either adenosine or phosphates (the two 
chemical groups of ATP) were proposed [5,6]. However, 
these motifs/templates are usually confined to one or 
several protein families and cover only a small subset of 
the ATP-binding sites. 

The large number of protein sequences which lack 
tertiary structure motivates development of computa- 
tional tools for high-throughput sequence-based annota- 
tion of ATP-binding residues. At the same time, to the 
best of our knowledge, ATPint [7] is the only sequence- 
based predictor of the ATP-binding residues. We 
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propose a novel method, named ATPsite, which aims to 
improve over the predictive quality of ATPint and other 
popular ways to annotate hinging residues, including 
sequence alignment and conservation scoring. In con- 
trast to the ATPint, which only takes PSSM profile and 
sequence descriptors as the inputs, the ATPsite uses a 
comprehensive set of relevant inputs. These inputs, 
which include PSSM profile, sequence descriptors, con- 
servation scores, and predicted secondary structure, rela- 
tive solvent accessibility (RSA), and dihedral angles, are 
encoded into a set of custom-designed features that are 
shown to improve the quality of the ATP-binding 
predictions. 

Methods 

Dataset 

We extracted all complexes in PDB (as of February 
2010) that include ATP. The maximal pairwise sequence 
identity of the resulting protein chains was reduced to 
40% with CD-hit [8]. The remaining 227 chains that 
interact with ATP constitute the dataset used in this 
study. Similar to the annotation of DNA-binding resi- 
dues and residues interacting with small ligands [9,10], a 
given residue is annotated as ATP-binding if at least 
one of its non-hydrogen atom is less than 3.9A away 
from a non-hydrogen atom of the ATP molecule; our 
dataset includes 3393 ATP-binding residues and 80409 
non-binding residues and is available at http://biomine. 
ece.ualberta.ca/ ATPsite/. 

Evaluation criteria and test procedure 

We use 5-fold cross validation to assess the predictions. 
We evaluated predictions at two levels: i) the binary 
value that defines whether a given residue does or does 
not bind to ATP; and ii) the real value that quantifies 
the probability of binding to ATP. The binary predic- 
tions were assessed using four measures 

Sensitivity = TP / (TP + FN) Specificity = TN / (FP + TN) 

kA ^ TPxTN -FPxFN 

MCC = 



Accuracy 



^(TP+FP) x (TP+FN) x (TN+FP) x (TN+FN) 
TP + TN 



TP + FP + TN + FN 



where TP (true positives) and TN (true negatives) are 
the counts of correctly predicted binding and non-bind- 
ing residues, respectively, FP (false positives) are non- 
binding residues that were predicted as binding residues, 
and FN (false negatives) are binding residues that were 
predicted as non-binding residues. The Matthews corre- 
lation coefficient (MCC) ranges between -1 and 1 and it 
equals zero when all residues are predicted as binding 
or non-binding. Higher MCC value indicates better 
predictions. 



The receiver operating characteristic (ROC) curve was 
used to examine the predicted probabilities. For each 
value of probability p achieved by a given method 
(between 0 and 1), the residues with probability > p are 
set as the binding residue, and all other residues are set 
as the non-binding residue. Next, the TP-rate and the 
FP-rate are calculated and we use the area under the 
curve (AUC) to quantify the predictive quality. 

We analyze statistical significance of the differences in 
the MCC and AUC values between predictions gener- 
ated by ATPsite and the other methods. The MCC 
values are available for all methods while the AUC value 
cannot be calculated for an alignment-based predictor. 
These values are calculated per sequence (using the 
cross-validated predictions) for each method and we 
compare them using a paired Wilcoxon rank sum test 
at 0.01 significance. This non-parametric test is used 
since the per sequence MCC and AUC values do not 
follow normal distribution, as tested using Shapiro-Wilk 
test at the 0.05 significance. 

Architecture of the proposed predictor 

The architecture of the ATPsite predictor is shown in 
Figure 1. For a given protein sequence, PSIPRED [11] is 
used to predict the secondary structure, REAL Spine3 
[12] predicts the RSA values and dihedral angles, and 
PSIBLAST [13] generates the PSSM profile. In addition 
to these inputs, other features such as conservation 
scores and amino acid (AA) groups are directly calcu- 
lated from the sequence. Feature selection was per- 
formed to remove irrelevant and redundant features. 
The selected features are fed into a Support Vector 
Machine (SVM), which is implemented using LIBSVM 
[14], to generate probabilities of ATP-binding. Selection 
of this classifier is motivated by its successful use in the 
ATPint [7]. 

Feature-based representation 

The ATPsite utilizes both sequence and predicted struc- 
ture descriptors as inputs. We utilize a sliding window 
of size 17 centered at the predicted residue to extract 
the input features. This window size was suggested in 
[7] to yield favorable predictive performance. The con- 
sidered features include: 



PSIPRED 



REAL Spine 3 



PSI-BLAST 



input 
* features ~ 



SVM 
classifier 



predicted 
* probability of 
ATP-binding 



Figure 1 Architecture of the ATPsite predictor. 
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- Predicted secondary structure generated by PSIPRED 
[11]. We use probabilities of the 3 secondary structure 
states for each residue in the window. 

- Predicted relative solvent accessibility generated by 
Real-SPINE3 [12]. We use the real values, which quan- 
tify the fraction of the surface area of a given residue 
that is accessible to the solvent, for the residues in the 
window. 

- Predicted dihedral angles generated by Real-SPINE3 
[12]. We utilize two real values, which represent phi 
(involving the backbone atoms C-N-C a -C) and psi 
(involving the backbone atoms N-C a -C-N) angles. 

- PSSM profile generated by PSIBLAST [13] with 
default parameters. We normalize these inputs with 1/(1 
+2~ x ), where x is the raw value from the PSSM profile; 
this transformation is commonly used in secondary 
structure prediction. For a window centered at R t residue 
at i th position, we calculate 17x20 features f i+kf j where k= 
-8, -7,... ,7, 8 is the index of the position in the window 
and 7=1,2,. ..,20 is the index of the PSSM column. We 
averaged values on the left and right sides of the central 
residue gi +Zt j=(fi +Zt j+fi- Zt j)/2 where z=0,l,. ..,8. As a result, 
the original 17x20 values are transformed to 9x20 values. 

- AA groups including hydrophobic residues (Ala, Cys, 
He, Leu, Met and Val), negatively charged (Asp and 
Glu), positively charged (His, Lys, Arg) and carboxa- 
mide-containing AAs (Asn and Gin). 

- Terminal indicator is set to 1 for the first and last 3 
residues in the sequence and 0 for the other positions. 

- Secondary structure segment indicator for helix/ 
strand/ coil predictions from PSIPRED on both sides of 
the window is calculated. If 4 (3) consecutive residues 
on the left/right side of the window are predicted as 
helix (strand), we set the helix (strand) indicator as 1 for 
the left/right side. If both helix and strand indicators are 
0, then the coil indicator is set as 1. 

- Residue conservation scores are calculated based on 
the Shannon entropy (referred to as conservation A) 
and two other formulas proposed in [15,16] (named 
conservation B and C, respectively) which incorporate 
the background frequency of the amino acids. 

- Collocation of AA pairs[17,18] is calculated for the 
residues in the window, which is motivated by results for 
membrane proteins where certain AA pairs are over- 
represented [19]. Similarly, several sequence motifs occur 
frequently at the ATP binding sites. To accommodate for 
mutations in these motifs, we use collocated AA pairs 
(pairs with gaps) to characterize these motifs. We only 
consider pairs formed between the central residue in the 
window and another residue up to 5 positions away. This 
results in 20x20x10=4000 frequencies (for 20 AA types 
and 10 positions; 5 on each side). The same as in the 
work for the membrane proteins [19], ^-values that indi- 
cate the significance of the association between an AA 



pair and ATP-binding annotation are calculated. An AA 
pair with low p-vdlue indicates a low probability that the 
association between this pair and ATP-binding is a coin- 
cidence. When analyzing 4000 randomly distributed vari- 
ables, we expect to observe by chance one instance of a 
difference from expected value with significance p < 
0.00025 (1/4000). We exclude the AA pairs with p > 10" 6 , 
since based on the Engelmans study [19] their associa- 
tion with ATP-binding event would be random. 

We note that the terminal and secondary structure 
indicators, collocation of AA pairs, and the predicted 
secondary structure, relative solvent accessibility, and 
dihedral angles were never before used to predict the 
ATP-binding residues. 

Feature selection and parameterization 

We use 5-fold cross validation to compute feature selec- 
tion and evaluation. The dataset is randomly divided 
per-sequence into 5 folds, of which 4 are used for train- 
ing and the one for testing; each of the 5 folds is used 
once as the test fold. This procedure assures that anno- 
tations of ATP binding from test folds are not used to 
train the predictive model. Biserial correlation is calcu- 
lated between each of the features and the binary anno- 
tation of ATP-binding residues for each of the 5 
training sets. The averaged, over the 5 training sets, cor- 
relation values were used to rank the features. We used 
a best first forward feature selection. Given a feature list 
F=\f b i=l,2,...,n], sorted by the average correlation in the 
descendent order, and an empty list S consisting of 
selected features, in each round we add the top-ranked 
feature from F to S and run default SVM with linear 
kernel and complexity constant C=l on the feature set S 
(using 5-fold cross validation). If addition of a given fea- 
ture improves the average AUC value over the 5 test 
folds, this feature is retained in S; otherwise it is 
removed. We repeat that until F is empty. 

The SVM is parameterized using the selected feature 
set S which includes 96 features. The features are sum- 
marized in Table 1. All parameterization steps maximize 



Table 1 Summary list of the selected features. 



Feature group 


# of selected features 


Predicted secondary structure 


12 


Predicted relative solvent accessibility 


3 


Predicted dihedral angles 


6 


PSSM profile 


35 


AA groups 


11 


Terminal indicator 


0 


Secondary structure segment indicator 


1 


Residue conservation scores 


19 


Collocation of AA pairs 


9 


Total 


96 
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the average cross-validated AUC value. We considered 
polynomial and RBF kernels. For polynomial kernel, C is 
initially fixed at 1 and the degree of the polynomial is 
adjusted between 0.5 and 5. The degree that results in 
the highest average AUC is selected and next we adjust 
C with consecutive powers of 2 between 2" 3 and 2 5 . 
Similarly for the RBF kernel, the gamma parameter is 
first optimized when C is fixed at 1, and next C is 
adjusted using the 2" 3 and 2 5 range. The threshold used 
to binarize the probabilities is set to maximize the MCC 
value of the predictions on the training folds. A residue 
is classified as ATP-binding if its predicted probability > 
0.182, which is the averaged threshold that maximizes 
the MCC of our method over the 5 training folds; other- 
wise it is categorized as non-binding. 

Baseline predictors 

ATPsite is compared with the existing ATPint as well as 
three baseline predictors that implement commonly 
used approaches to find binding residues: 

- Rate4site[20] predicts functional sites by finding 
conserved residues. We first run PSI-Blast using the 
query sequence against the NCBI non-redundant data- 
base. For chains with at least 3 significant matches, we 
created alignments of the best 50 sequences (the default 
for Consurf [21], which is the web version of rate4site) 
using ClustalW [22] and we inputted them to rate4site. 
The rate4site generates conservation score for each resi- 
due and residues with lower scores (indicating a higher 
conservation) have higher probability to be binding resi- 
dues. We use these scores to compute ROC curves and 
the threshold that maximizes the MCC value is used to 
binarize the conservation scores. 

- Sequence alignmentusing BLAST identifies similar 
sequences or segments from a given annotated (with 
ATP-binding residues) dataset for a query sequence. 
This approach predicts the binding residues by using 
the ATP-binding annotations from the best aligned 
sequence. We execute the BLAST-based alignment 
between a query sequence and all other sequences 
(except the query sequence itself) in the benchmark 
dataset. The sequence with lowest E-value is selected 
as the template. The residues in the query sequence 
that were aligned with the binding residues on the 
template chain are predicted as the ATP-binding 
residues. 

- PSSM profile is widely used in related sequence- 
based predictors, including the ATPint predictor [7]. To 
validate the effectiveness of the features proposed in this 
work, we build a simple predictor that uses SVM (which 
used the same parameters as the SVM in ATPsite) and 
takes PSSM profile as the input. This allows estimation 
of the improvements provided by the new features. 



Results 

Comparison with existing methods 

The ATPsite predictor is compared with ATPint (we 
used the web server at http://www.imtech.res.in/ 
raghava/atpint/) and the three baseline predictors based 
on the alignment, conservation scoring and evolutionary 
profiles, see Table 2. ATPsite is shown to outperform 
other methods. It improves the AUC and MCC by 0.03 
and 0.07 when compared with the runner-up PSSM- 
based and BLAST-based predictors, respectively. These 
improvements are statistically significant with p < 0.01. 
ATPsite also achieved the highest (tied with BLAST) 
accuracy and the second best sensitivity and specificity. 
Since rate4site only considers residue conservation, it 
likely finds binding residues for other ligand types, cata- 
lytic residues, and key residues that are critical for pro- 
tein folding. This explains the high sensitivity and low 
specificity, relative to the other methods, achieved by 
the rate4site. The alignment-based predictions have the 
best specificity and low sensitivity. This shows that 
some ATP-binding residues are in conserved sequence 
segments, but the sequence conservation can accommo- 
date only for a fraction of the ATP-binding residues. 
The predictive quality of ATPint is lower than it was 
originally reported [7]. The authors of ATPint balanced 
the number of binding and non-binding residues in 
their study which likely lead to overestimation of the 
predictive quality when applied to full protein chains. 
Importantly, applications of the new features results in 
significant improvements as demonstrated by the differ- 
ences between ATPsite and PSSM+SVM methods. The 
ROC curves are shown in Figure 2. The BLAST-based 
method does not provide probabilities and thus it is 
represented by a point that corresponds to the binary 
predictions. Figure 2 focuses on the FP rates < 0.05 

Table 2 Comparison between ATPsite, ATPint and three 
baseline predictors that use alignment (BLAST), 
conservation scoring (rate4site) and evolutionary profiles 
(PSSM+SVM). The "Significance" column reports 
statistical significance tests that compare paired per- 
sequence AUC and MCC between ATPsite and other 
methods; + indicates that ATPsite is significantly better 
at 0.01 level. 



Predictor Predicted Predicted binary annotations 

probabilities 





AUC Significance 


SENS 


SPEC 


ACC 


MCC 


Significance 


ATPsite 


0.854 


0.361 


0.988 


0.962 


0.433 




Rate4site 


0.749 + 


0.446 


0.87 


0.852 


0.182 


+ 


PSSM 


0.824 + 


0.354 


0.957 


0.933 


0.27 


+ 


+SVM 














BLAST 


Not applicable 


0.243 


0.993 


0.962 


0.359 


+ 


ATPint 


0.627 + 


0.539 


0.651 


0.648 


0.078 


+ 
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Figure 2 ROC curves for ATPsite, ATPint, rate4site and the 
predictor based on PSSM with SVM classifier. The FP-rate is 
constrained to [0, 0.05] range and the BLAST-based solution is 
shown using a single point that corresponds to the binary 
predictions. The full ROC curve can be found in the Supplementary 
Figure 1 in the online supplement at http://biomine.ece.ualberta.ca/ 
ATPsite/. 



TP rate AA groups 

0.5 t — * — AA pairs 

Predicted dihedral angles 
-a — Conservation A 
-i — Conservation B 
Conservation C 
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0.3 -- 
0.25 -- 

0.2 
0.15 -- 



— t> - Secondary structure indicator 

Terminal indicator 

Predicted RSA 

— °— Predicted secondary structure 




0.01 



0.02 



0.03 



0.04 



rate 



0.05 



Figure 3 ROC curves calculated based on predictions 
generated using individual input types. The FP-rate is 
constrained to [0, 0.05]. The full ROC curve can be found in the 
Supplementary Figure 2 in the online supplement at http://biomine. 
ece.ualberta.ca/ ATPsite/. 



since only about 4% of residues bind to ATP. The full 
ROC is given in the Supplementary Figure 1 in the sup- 
plement at http://biomine.ece.ualberta.ca/ATPsite/. The 
ATPsite achieves higher TP rates for the low FP rates 
consistently outperforming other solutions. 

Effectiveness of individual input types 

We assessed the ability of individual input types to pre- 
dict the ATP-binding residues. We use the same SVM 
classifier but we utilize the selected features from one 
input source at the time, see Table 3. The correspond- 
ing ROC curves are shown in Figure 3 (for FP rates < 

Table 3 Predictive quality achieved with individual input 
types; the inputs are sorted in the descending order 
using the AUC values. 



Input type 


# 

features 


AUC 


SENS 


SPEC 


ACC 


MCC 


PSSM profile 


180 


0.824 


0.354 


0.957 


0.933 


0.270 


AA groups 


36 


0.785 


0.390 


0.924 


0.902 


0.218 


Conservation B 


17 


0.745 


0.470 


0.860 


0.844 


0.181 


Conservation A 


17 


0.744 


0.427 


0.892 


0.873 


0.194 


Conservation C 


17 


0.741 


0.564 


0.792 


0.783 


0.170 


Pred. secondary structure 


51 


0.732 


0.651 


0.720 


0.717 


0.161 


Sec. str. segment 
indicator 


6 


0.706 


0.656 


0.683 


0.682 


0.143 


Pred. dihedral angles 


34 


0.696 


0.656 


0.637 


0.638 


0.120 


Collocation of AA pairs 


153 


0.671 


0.182 


0.990 


0.957 


0.261 


Pred. solvent accessibility 


17 


0.628 


0.194 


0.918 


0.889 


0.079 


Terminal indicator 


1 


0.517 


0.989 


0.045 


0.084 


0.033 



0.05) and in full in the Supplementary Figure 2 in the 
online supplement at http://biomine.ece.ualberta.ca/ 
ATPsite/. Among the 11 feature groups, the PSSM pro- 
file achieves the highest AUC, followed by the AA 
groups and conservation scores. The 3 different types of 
scoring functions that we tried provide comparable pre- 
dictive value. The AUC values for features based on pre- 
dicted secondary structure and dihedral angles are above 
0.7. Importantly, predictions obtained with these indivi- 
dual feature sets have substantially lower quality than 
predictions obtained with the combined set of features, 
which support the utility of our solution that uses and 
combines a set of novel inputs. 

Consensus-based predictions 

The machine-learning based ATPsite, alignment-based 
BLAST, and conservation-based rate4site utilize differ- 
ent and potentially orthogonal approaches, which moti- 
vates building consensus methods that would exploit 
this potential complementarity. We built 4 simple inclu- 
sive disjunction type ensembles of these 3 predictors 
(including 3 pairs of methods and all 3 methods 
together). For instance, the consensus of ATPsite and 
rate4site predicts a residue as ATP-binding if any of 
these two methods predicts this residue as binding; the 
corresponding probability equals to the maximal prob- 
abilities generated by the component predictors. The 
conservation scores of rate4site are linearly normalized 
to the [0, 1] range. We compare the ROC curves, AUC 
and MCC values, of the consensus and standalone 
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predictors, see Figure 4 and Table 4 in the main text 
and the Supplementary Figure 3 in the online supple- 
ment at http://biomine.ece.ualberta.ca/ATPsite/. Only 
the consensus of ATPsite and BLAST improves the 
AUC and MCC values of the best standalone ATPsite. 
The consensus achieves AUC = 0.861 and MCC = 0.46 
compared to 0.854 and 0.433 obtained with ATPsite, 
respectively. This suggests that the predictions from the 
ATPsite and the BLAST-based predictor are comple- 
mentary. The likely reason why rate4site does not help 
is since ATPsite already utilizes conservation scores, 
although they are computed in a different way. 

The predicted probability implies confidence 

We investigate whether the probabilities of the ATP- 
binding generated by ATPsite could be used as confi- 
dence scores for the binary predictions. We binned the 
residues into twenty 0.05 wide intervals based on their 
predicted probabilities and we computed the corre- 
sponding average accuracy and percentage of the resi- 
dues in each of these bins, see Figure 5. For residues 
with high, >0.9, or low, <0.1, probabilities, the accuracies 
are higher than for the remaining residues. The aver- 
aged accuracy for 0.6% residues (including 14% of the 
ATP-binding residues) that are predicted with probabil- 
ity > 0.9 equals 0.85, and for the 94% of residues with 
probability < 0.1 it equals 0.98. 

Case study 

We demonstrate predictions using chain A of phos- 
phofructokinase6 (PDBid 3CQD). The predictive 




0 0.01 0.02 0.03 0.04 0.05 

Figure 4 ROC curves of the ensemble predictors. The FP-rate is 
constrained to [0, 0.05]. The full ROC curve can be found in the 
Supplementary Figure 3 in the online supplement at http://biomine. 
ece.ualberta.ca/ATPsite/. 

v / 



Table 4 Comparison between 4 ensemble predictors and 
their component predictors. 



Method 


AUC 


MCC 


ATPsite 


0.854 


0.433 


BLAST 


NA 


0.359 


Rate4site 


0.749 


0.182 


ATPsite+BLAST 


0.861 


0.460 


ATPsite+Rate4site 


0.763 


0.260 


BLAST+Rate4site 


0.779 


0.234 


ATPsite+BLAST+Rate4site 


0.797 


0.420 



quality of the considered methods for this target is 
similar to their average quality on the entire dataset. 
The native ATP-binding residues, the binary predic- 
tions, and the probabilities predicted by ATPsite, 
rate4site, BLAST and PSSM+SVM methods are shown 
in Figure 6. The conservation scores from rate4site 
were divided by 10 to fit the Figure. The native ATP- 
binding residues are clustered into four segments, resi- 
due 185 to 189, 224 to 229, 248 to 258 and 280 to 
287. ATPsite captured the four segments and correctly 
predicted 9 out of the 17 binding residues with 4 false 
positives (FPs). The PSSM profile-based predictor 
missed the last binding segment and successfully iden- 
tified 7 out of the 17 binding residues with 23 FPs. 
The BLAST-based predictor missed two of the binding 
segments and correctly predicted 8 of the 17 binding 
residues with 7 FPs. The rate4site found the four seg- 
ments and successfully identified 12 binding residue, 
however, it also produced 81 FPs. 

Conclusions 

We developed a new method, ATPsite, for the 
sequence-based prediction of ATP-binding residues. 
Our predictor is empirically shown to outperform the 




probability generated by ATPsite 



Figure 5 The distribution of average accuracies (shown using 
bars) for residues binned into twenty 0.05 wide intervals 
based on their probability of ATP-binding predicted by 
ATPsite. The percentage of the residues in each bin is shown 
above the bars. 
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• PSSIVH-SVM binary • ATPsite binary • Rate4site binary • BLAST binary 

PSSIVH-SVM pro b ATPsite_prob Rate4site prob • Native binding residues 

PSSIVH-SVM cut-off ATPsite cut-off Rate4site cuto-off 




Figure 6 Comparison of predictions for chain A of phosphofructokinase6 (PDB id 3CQD). Solid lines at the top show the predicted 
probabilities for PSSM+SVM (in red), rate4site (green), and ATPsite (blue). The dashed lines denote cut-offs used to binarize the probabilities and 
the corresponding binary predictions are shown using dots (one dot per residue) at the bottom. Black dots denote native ATP-binding residues, 
and blue, red, green and gray denote predictions from ATPsite, PSSM+SVM, rate4site and BLAST, respectively. 

V ) 



existing approaches. These improvements are attributed 
to the usage of a novel and comprehensive set of input 
features, which include both sequence and predicted 
structural descriptors. We also found that a simple con- 
sensus of ATPsite with BLAST-based method leads to 
additional improvements. The consensus-based predic- 
tor achieves AUC = 0.861 and MCC = 0.46, which 
demonstrates that these predictions provide useful infor- 
mation for the high-throughput, sequence-based annota- 
tion of the ATP-binding residues. 
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